Finding backdoors mechanistically is NP-hard.

I wrote this hardness proof of a specific type of backdoor detection. However, I realized that the context of this proof is a lot more important than the proof itself.

The context: Types of backdoor detection problems

Firstly, when I say backdoors, I also mean (to some extent) adversarial inputs and deceptive alignment. All of these involve some rare but bad behavior, and I haven’t seen anyone cleanly separate these terms. Any backdoor I insert with fine-tuning could’ve been coincidentally inserted by regular training on the training data. We call the latter case an adversarial input, but the only difference is this fuzzy concept of intentionality. I could also insert a backdoor that makes the AI try to kill all humans if and only if we’re in $CURRENT_YEAR + 1 (like Hubinger et al. (2024)). If this backdoor was inserted by training data, it’d look like deceptive alignment.

An interesting formalization that comes close to separating deceptive alignment from adversarial inputs is from Christiano et al. (2024). They consider when the adversary doesn’t get to choose the backdoor trigger. In real deceptive alignemnt, the possible backdoor triggers are constrainted to be an opportune moments to kill all humans, and the bad behavior is actually trying to kill them.

However, I’m not going to use Christiano’s formalism. The question I’ll ask is whether some bad output is possible. For instance a dog classifier has no adversarial inputs or backdoors if, for all inputs similar to any dog image in the dataset, it’s unlikely that any of them are classified as NOT-DOG.

Under this formalism, there are 3 main problems to solve. Going from computationally hardest to easiest, we have the following:

There’s also options for defining what’s a bad output, but they probably aren’t that different. The definition I’ll use is a output logit above some threshold. If you prefer, you could also imagine it as a linear probe over many output logits.

Hardness proof of worst case detection

In this proof, the neural net we’ll work with is a 1 layer MLP with ReLUs. (\(\sigma\) is the ReLU function). If detection is NP-hard in this simple 1 layer MLP, then it’s also NP-hard in all models which contain 1 layer MLPs. This is what our MLP looks like: \[ y = C\sigma(Ax + b) + d \] The problem at hand is detecting whether the MLP can exceed some threshold, so it looks more like this. \[ \exists x : C \sigma(A x + b) \ge t \]

This will be a reduction to 3SAT. First, we need a gadget to enforce discreteness of an input element \(x\). We can set the output threshold to 1 and then craft the weights such that the only way to reach that threshold is to have \(x \in \{+1, -1\}\) \[ -2 \sigma(x-1) + \sigma(x) + \sigma(-x) -2 \sigma(-x-1) \ge 1 \] When \(x\) is positive and only the components on the left contribute, we need \(x\) to be at least 1 to reach the threshold. However, the punishing component \(-2\sigma(x-1)\) ensures that increasing \(x\) beyond 1 will only decrease the output. The same logic occurs when \(x\) is negative.

This gadget can easily compose to enforce discreteness of many inputs. \[ \sum_{i=0}^n -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1) \ge n \]

Once we have discreteness, the problem becomes easy. We can add a 3SAT clause gadget that only exceeds the threshold when the clause is satified. \[ - \sigma(\pm x_i \pm x_j \pm x_k - 2) \ge 0 \] Only when \(\pm x_i, \pm x_j, \pm x_k\) are all positive, corresponding to all 3 literals being false in a 3SAT, does this go below zero. In the other cases it’s equal to zero.

Putting this all together, here’s an example of a 3SAT clause and its representation as a 1 layer MLP: \[ (x_1 \vee x_2 \vee \neg x_3) \wedge (\neg x_1 \vee x_3 \vee x_4) \wedge (x_2 \vee \neg x_3 \vee \neg x_4) \wedge (x_1 \vee \neg x_2 \vee x_4) \]

\[ \left(\sum_{i=0}^4 -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1)\right) \\ - \sigma(-x_1 - x_2 + x_3 - 2) - \sigma(x_1 - x_3 - x_4 - 2) \\ - \sigma(-x_2 + x_3 + x_4 - 2) - \sigma(-x_1 + x_2 - x_4 - 2) \ge 4 \]

This concludes the proof. A similar proof can also be made for 2 layer MLPs without biases.